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SYSTEM AND METHOD FOR RESCORING N-BEST HYPOTHESES 
OP AN AUTOMATIC SPEECH RECOGNITION SYSTEM 



GOVERNMENT LICENSE RIGHTS 

This invention was developed under United States 
Government ARPA Contract No. MDA 972-97-C0012 . The United 
States Government has certain rights to the invention. 

BACKGROUND 

1 . Technical Field : 

The present invention relates generally to speech 
recognition and, more particularly, to a system and method 
for rescoring N-best hypotheses output from an automatic 
speech recognition system by utilizing an independently 
derived text-to-speech (TTS) system to generate a synthetic 
waveform for each N-best hypothesis and comparing each 
synthetic waveform with the original speech waveform to 
select the final system output. 

2 . Description of Related Art : 

A common technique which is utilized in speech 
recognition is to first produce a list of the N most-likely 
("N-best") hypotheses for each utterance and then rescore 
each of the N-best hypotheses using one or more knowledge 
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sources not necessarily modeled by the speech recognition 
system which produced the N-best hypotheses. 
Advantageously, this "N-best rescoring" method enables 
additional knowledge sources to be brought to bear on the 
5 recognition task without having to integrate such sources 

into the initial decoding system. 

One such "N-best rescoring" method is disclosed in 
"An Articulatory-Like Speech Production Model with 
Controlled Use of Prior Knowledge" by R. Bakis, Frontiers in 

10 Speech, CD-Rom, 1993. With this method, an articulatory 

model which generates acoustic vectors (not speech 
waveforms) given a phonetic transcription is utilized to 
produce acoustics against which the original speech may be 
compared. Other "rescoring" methods are known to those 

15 skilled in the art. 

As is understood by those skilled in the art, the 
techniques utilized for speech recognition and speech 
synthesis are inherently related. Consequently, increased 
knowledge and understanding and subsequent improvements for 

20 one technique can have profound implications for the other. 

Due to the recent advances in text-to-speech (TTS) systems 
which have enabled high quality synthesis, it is to be 
appreciated that a TTS system can sufficiently provide a 
source of knowledge about what the speech signal associated 
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with each of the N-hypothesis would look like. Currently, 
there exists no known systems or methods which utilize a TTS 
system for rescoring N-best hypotheses. Therefore, based on 
the similarities between speech recognition and speech 
5 synthesis, it is desirable to employ a TTS system as a 

knowledge source for use in rescoring N-best hypotheses. 

SUMMARY OF THE INVENTION 

The present invention is directed to a system and 
method for rescoring N-best hypotheses of an automatic 
10 speech recognition system, wherein the N-best hypotheses 

comprise the N most likely text sequences of a decoded 
original waveform. In one aspect of the present invention, 
a method for rescoring N-best hypotheses comprises the steps 
of: 

15 generating a synthetic waveform for each of the N 

text sequences; 

comparing each synthetic waveform with the 
original waveform to determine the synthetic waveform that 
is closest to the original waveform; and 
20 selecting for output the text sequence 

corresponding to the synthetic waveform determined to be 
closest to the original waveform. 
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In another aspect of the present invention, in 
order to compare the original and synthetic waveforms, each 
is transformed into a set of feature vectors using the same 
feature analysis process. 
5 In another aspect of the present invention, the 

original and each of the synthetic waveforms representing 
the Nth hypotheses are compared on a phoneme-by-phoneme 
basis by segmenting (aligning) the stream of feature vectors 
into contiguous regions, each region representing the 
10 physical representation of one phoneme in the phonetic 
expansion of the hypothesized text sequence. 

In another aspect of the present invention, an 
automatic speech recognition system comprises: 

a decoder for decoding an original waveform of 
15 acoustic utterances to produce N text sequences, the N text 

sequences representing N-best hypotheses of the decoded 
original waveform; 

a waveform generator for generating a synthetic 
.waveform for each of the N text sequences; and 
20 a comparator for comparing each synthetic waveform 

with the original waveform to rescore the N-best hypotheses. 

Advantageously, by comparing the synthetic 
waveforms (for each of the N most-likely text sequences) to 
the original waveform, one can incorporate the body of 
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knowledge and understanding required to build the synthesis 
model into the N-best framework for rescoring the top N 
hypotheses . 

These and other aspects, features and advantages 
5 of the present invention will be described and become 

apparent from the following detailed description of 
preferred embodiments, which is to be read in connection 
with the accompanying drawings. 



10 BRIEF DESCRIPTION OF THE DRAWINGS 

Fig. 1 is a block/flow diagram of a system/method 
for rescoring N-best hypotheses in accordance with an 
embodiment of the present invention; and 

Figs. 2A and 2B comprise a detailed flow diagram 
15 of a method for rescoring N-best hypotheses in accordance 

with one aspect of the present invention. 



DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS 

It is to be understood that the system and method 
described herein may be implemented in various forms of 
20 hardware, software, firmware, special purpose 

microprocessors, or a combination thereof. Preferably, the 
present invention is implemented in software as an 



YO999-046 PJO (8728-252) 



-5- 



application program tangibly embodied on a program storage 
device. The application program may be uploaded to, and 
executed by, a machine having any suitable and preferred 
microprocessor architecture. Preferably, the machine is 
5 implemented on a computer platform having hardware such as 

one or more central processing units (CPU), a random access 
memory (RAM), and input/output (I/O) interf ace ( s ) . The 
computer platform also includes an operating system and 
microinstruction code. The various processes and functions 
10 described herein may either be part of the microinstruction 

code or part of the application program (or a combination 
thereof) which is executed via the operating system. In 
addition, various other peripheral devices may be connected 
to the computer platform such as an additional data storage 
15 device and a printing device. 

It is to be further understood that, because some 
of the constituent system components and method steps 
depicted in the accompanying Figures are preferably 
implemented as software modules, the actual connections 
20 between the system components (or the process steps) may 

differ depending upon the manner in which the present 
invention is programmed. Given the teachings herein, one of 
ordinary skill in the related art will be able to 
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contemplate these and similar implementations or 
configurations of the present system and method. 

Referring now to Fig. 1, a block diagram 
illustrates a system for rescoring N-best hypotheses of an 
automatic speech recognition system in accordance with an 
embodiment of the present invention. It is to be understood 
that the diagram depicted in Fig. 1 can also be considered a 
general flow diagram of a method for rescoring N-best 
hypotheses in accordance with the present invention. The 
system 100 includes a feature analysis module 101 which 
receives and digitizes input speech waveforms (spoken 
utterances), and transforms the digitized input waveforms 
into a set of feature vectors on a frame-by- frame basis 
using feature extraction techniques known by those skilled 
in the art. Typically, the feature extraction process 
involves computing spectral or cepstral components and 
corresponding dynamics such as first and second derivatives. 
Preferably, the feature analysis module 101 operates by 
first producing a 24-dimensional cepstra feature vector for 
every 10ms of the input waveform, splicing nine frames 
together (i.e., concatenating the four frames to the left 
and four frames to the right of the current frame) to 
augment the current vector of cepstra, and then reducing 
each augmented cepstral vector to a 60-dimensional feature 
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vector using linear discriminant analysis. The input 
(original) waveform feature vectors are then stored for 
subsequent processing as discussed below. 

The original waveform feature vectors are then 
5 decoded by a speech recognition system 102 having trained 

acoustic prototypes to recognize and transcribe the spoken 
words of the original waveform. In particular, the speech 
recognition system 102 is configured to generate N-best 
hypotheses 103 (i.e., the N most-likely text sequences 

10 (transcriptions) of the spoken utterances) . It is to be 

understood that any conventional technique may be employed 
in the speech recognition system 102 for generating the 
N-best hypotheses such as the method disclosed in "The 
N-Best Algorithm: An Efficient and Exact Procedure For 

15 Finding the N Most Likely Sentence Hypotheses" by Schwartz, 

et al., pp. 81-84. Proc. ICASSP, 1990. 

The N-best hypotheses 103 are input to a 
text-to-speech system (TTS) 104 to generate a set of N 
synthetic waveforms 105, each synthetic waveform being a 

20 text sequence corresponding to one of the N-best hypotheses 

103. It is to be understood that any conventional TTS 
system may be employed for implementing the present 
invention, although the preferred TTS system is 
International Business Machines 1 (IBM) trainable 
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text-to-speech system disclosed in U.S. Patent Application 
Serial No. 09/084,679, entitled: "Methods For Generating 
Pitch And Duration Contours In A Text To Speech System," 
which is commonly assigned and incorporated herein by 
reference . 

Briefly, with the IBM TTS system, the 
pronunciation of each word capable of being synthesized is 
characterized by its entry in a phonetic dictionary, with 
each entry comprising a string of phonemes which constitute 
the corresponding word. The TTS system concatenates 
segments of speech from phonemes in context to produce 
arbitrary sentences. A flat pitch equal to a training 
speaker's average pitch value is utilized to synthesize each 
segment. The duration of each segment is selected as the 
average duration of the segment in the training corpus plus 
a user-specified constant a times the standard deviation of 
the segment. The a term serves to control the rate of the 
synthesized speech and is fixed at a moderate value for all 
our experiments. The TTS system is built from data spoken 
by one male speaker who read 450 sentences of text. In 
operation, the IBM TTS system receives user-selected text 
sentence (s) and expands each word into a string of 
constituent phonemes by utilizing the synthesis dictionary. 
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Next, waveform segments for each phoneme are retrieved from 
storage and concatenated. The details of the procedure by 
which the waveform segments are chosen are described in the 
above-incorporated application. The pitch of the synthesis 

5 waveform is adjusted to flat using the pitch synchronous 

overlap and add (PSOLA) technique, which is also described 
in the above-incorporated application. The N synthetic 
waveforms are then saved to disk. 

Each of the N synthetic waveforms 105 are input to 

10 the feature analysis module 101 and subjected to the same 

feature analysis as discussed above (for processing the 
original speech waveform) to generate N sets of feature 
vectors, with each set of feature vectors representing a 
corresponding one of the N synthetic waveforms 105. The N 

15 sets of feature vectors may be stored for subsequent 

processing. It is to be understood that for purposes of 
illustration and clarity, the system of Fig. 1 is shown as 
having two feature analysis modules 101, although the system 
is preferably implemented using one feature analysis module 

20 for processing both the original and synthetic waveforms. 

A rescore module 106 compares the original 
waveform feature vectors with each of the N sets of 
synthetic waveform feature vectors and corresponding N-best 
text sequences to provide an N-best rescore output 110. In 
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particular, this comparison processes begins in alignment 
module 107, whereby the original waveform feature vectors 
and each set of N synthetic waveform feature vectors are 
aligned to the text sequence of the corresponding N-best 
hypothesis. A distance computation module 108 calculates 
the distance between the original waveform and each of the N 
synthetic waveforms (using methods known to those skilled in 
the art) . A comparator module 109 compares each of the 
calculated distances to rescore the N-best hypothesis based 
on the computed distances and determine the closest 
distance. The N-best text sequence corresponding to the 
closest synthetic waveform to the original speech is then 
output or otherwise saved as the final transcription of the 
utterance (i.e., the N-best rescore output 110). 

Referring now to Figs. 2A and 2B, a flow diagram 
illustrates a preferred method for rescoring N-best 
hypotheses of an automatic speech recognition system in 
accordance with the present invention. Specifically, the 
flow diagram of Figs. 2A and 2B illustrates a detailed 
comparison process which is preferably employed in the 
rescore module 106 of Fig. 1. Initially, the rescore module 
106 retrieves the original waveform feature vectors from 
memory (step 200) . The comparison process is then 
initialized by setting a parameter N = 1 (where N represents 



YO999-046 PJO (8728-252) 



-11- 



the Nth-best hypothesis (text sequence) output from the 
speech recognition system 102) and setting a parameter "Best 
Distance" = infinity (where "Best Distance" is a threshold 
value that represents the smallest computed distance measure 
5 of previous iterations) (step 201) . 

Next, the Nth-best text sequence and the 
corresponding Nth synthetic waveform feature vectors are 
then retrieved from memory (step 202) . The original 
waveform feature vectors and the Nth synthetic waveform 

10 feature vectors are then time-aligned to the Nth-best text 

sequence at the phoneme level (step 203) . The alignment 
procedure preferably employs a Viterbi alignment process 
such as disclosed in "The Viterbi Algorithm," by G.D. 
Forney, Jr., Proc. IEEE, vol. 61, pp. 268-278, 1973. In 

15 particular, as is understood by those skilled in the art, 

the Viterbi alignment finds the most likely sequence of 
states given the acoustic observations, where each state is 
a sub-phonetic unit and the probability density function of 
the observations is modeled as a mixture of 60-dimensional 

20 Gaussians. It is to be appreciated that by time-aligning 

the original waveform and the Nth synthesized waveform to 
the Nth hypothesized text sequence at the phoneme level, 
each waveform may be segmented into contiguous time regions, 
with each region mapping to one phoneme in the phonetic 
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expansion of the Nth text sequence (i.e., a segmentation of 
each waveform into phonemes) . 

After the alignment process, the mean of the 
feature vectors (frames) which align to each phoneme is 
5 computed for the original waveform and the Nth synthetic 

waveform (step 204) . In this manner, the original waveform 
and the Nth synthetic waveform may be represented as a 
collection of mean feature vectors, with each mean feature 
vector representing the computed mean of all feature vectors 

10 aligning to a corresponding phoneme in the Nth text 

sequence. This process results in the generation of M mean 
feature vectors representing the original waveform and M 
mean feature vectors representing the Nth synthetic waveform 
(where M represents the number of phonemes in the expansion 

15 of the Nth text sequence into its constituent phonemes) . 

Next, a distance measure between each phoneme mean 
of the original waveform and the corresponding phoneme mean 
of the Nth synthetic waveform is computed (step 205) . 
Although any suitable method may be employed for computing 

20 the distance measure, a Euclidean distance is preferably 

employed (by the distance computation module 108, Fig. 1) . 
These individual distance measures (between each 
corresponding phoneme mean) are then summed to produce an 
overall distance measure (step 206) representing the 
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"distance" between the original waveform and the Nth 

synthetic waveform corresponding to the Nth text sequence. 

Therefore, since the Nth synthetic waveform is derived from 

the Nth-best text sequence, it is to be appreciated that the 
5 overall distance measure indirectly represents the 

"distance" between the original waveform and the Nth-best 

text sequence . 

A determination is then made as to whether the 

"distance" (which represents the overall distance between 
10 the oriqinal waveform and the Nth text sequence) is less 

f!J than the current "Best Distance" value (step 207) . If the 

S "distance" is smaller than the "best distance" value 

y| (affirmative determination in step 207), a parameter "Best 

7 Text" is set so as to label the current Nth-best text 

,p 15 sequence as the most accurate transcription encountered as 

a,; compared to all previous iterations, and the parameter "best 

yB distance" is set equal to the current "distance" value (step 

208) . 

A determination is then made as to whether there 
20 are any remaining N-best hypotheses for consideration (step 

209) . If there are additional N-best hypotheses (negative 
determination in step 209), the parameter N is incremented 
by one (step 210), and the next Nth-best text sequence and 
Nth synthetic waveform are retrieved from memory (return to 
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step 202, Fig. 2A) . This comparison process (steps 203-208) 
is repeated for N iterations (to rescore each N-best 
hypothesis) . When it is determined that the final Nth-best 
hypothesis has been rescored (affirmative determination in 
step 209), the Nth-best text sequence having the minimum 
distance to the original waveform (as indicated by the "best 
text" and "best distance" parameters) is output (step 211) . 
After the final output (step 211), the user may choose to 
rescore the N-best hypotheses of another original waveform 
(affirmative result in step 212) in which case the desired 
waveform will be retrieved from memory (return to step 200) 
and processed as described above. Alternatively, the user 
may terminate the rescore process and exit the program (step 
213) . 

The above described preferred embodiment has been 
tested on speech degraded by the inclusion of additive noise 
in the form of background music. Test results have 
indicated an improvement of the word error rate from 27.8 
percent to 27.3 percent using the two most-likely text 
hypotheses for each utterance. The improvement primarily 
results from a reduction in the number of erroneously 
inserted words. 

It is to be appreciated by those skilled in the 
are that is some flexibility within the general framework of 
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the present invention, thereby providing alternate 
embodiments of the above-described preferred embodiment. 
For instance, as noted above, different methods for 
measuring the distance between the original and synthetic 
waveforms may be substituted for the Euclidian distance 
measure described above. 

In another embodiment of the present invention, in 
addition to re-ordering the N-best list based strictly on 
the distance of each synthesized hypothesis to the original 
waveform, the distance may be combined with other scores 
reflecting our confidence in the correctness of the N-th 
hypothesis, such as the likelihood of that hypothesis as 
assessed by the individual components comprising the 
automatic speech recognition system: the acoustic model and 
the language model. By combining the distance score with 
the scores from these sources, information provided by the 
decoder may be considered in conjunction with the new 
information provided by the distance score. For example, 
the scores may be combined by forming the following sum: 

S N = -D N +(a • A N ) + (b • L N ) 

where D N is the distance of the N-th hypothesis from the 
original waveform (as described above) ; where A N is the 
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acoustic model score of the N-th hypothesis; where L N is the 
language model score of the N-th hypothesis; and where a and 
b are constants. The text selected for output can then be 
the text associated with the N T -th hypothesis, where N T is 
5 the hypothesis whose score S N » is the maximum score among 

the N-best hypotheses considered. 

In yet another embodiment, the original speech 
and/or synthetic speech may be further processed to 
compensate for speaker-dependent variations. For example, a 

10 vocal tract length normalization process (such as disclosed 
in "A Parametric Approach to Vocal-Tract-Length 
Normalization", by Eide et al., Proceedings of the Fifteenth 
Annual Speech Research Symposium, Johns Hopkins University, 
1995; and "Speaker Normalization on Conversational Telephone 

15 Speech", by Wegmann et al., Vol. 1, Proc. ICASSP, pp. 

339-341, 1996) may be performed on each test utterance to 
warp the frequency axis for each test speaker to match the 
vocal-tract characteristics of the speaker from whose data 
the TTS system was built. This would reduce the component 

20 in the distance between utterances due to differences 

between the speaker of the original test utterance and the 
speaker of the TTS system, which causes a relative increase 
of the contribution to the distance scores due to phonetic 
differences between the utterances. 
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Although illustrative embodiments have been 
described herein with reference to the accompanying 
drawings, it is to be understood that the present system and 
.method is not limited to those precise embodiments, and that 
various other changes and modifications may be affected 
therein by one skilled in the art without departing from the 
scope or spirit of the invention. All such changes and 
modifications are intended to be included within the scope 
of the invention as defined by the appended claims. 
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WHAT IS CLAIMED IS : 

1. A program storage device readable by a 
machine, tangibly embodying a program of instructions 
executable by the machine to perform method steps for 

5 rescoring N-best hypotheses of a decoded original waveform 

output from an automatic speech recognition system, the 
N-best hypotheses comprising N text sequences, the method 
steps comprising : 

generating a synthetic waveform for each of the N 
10 text sequences; 

comparing each synthetic waveform with the 
original waveform to determine the synthetic waveform that 
is closest to the original waveform; and 

selecting for output the text sequence 
15 corresponding to the synthetic waveform determined to be 

closest to the original waveform. 

2. The program storage device of claim 1, 
wherein the instructions for performing the comparing step 
include instructions for performing the steps of: 

20 aligning frames of the original waveform and 

frames of each synthetic waveform to a corresponding one of 
the N text sequences; and 
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calculating the distance between the original 
waveform and each of the synthetic waveforms based on the 
corresponding alignments. 

3. The program storage device of claim 2, 

5 wherein the instructions for performing the comparing step 

further include instructions for: 

retrieving feature vectors corresponding to the 
original waveform; and 

generating feature vectors for each synthetic 
10 waveform such that the feature vectors for the synthetic 

waveforms are similar in structure to the feature vectors of 
the original waveform; 

wherein the alignment is performed by 
time-aligning the feature vectors of the original waveform 
15 and the feature vectors of each synthetic waveform with the 

corresponding one of the N text sequences. 

4. The program storage device of claim 2, 
wherein the alignment is performed using Viterbi alignment 
process . 

20 5. The program storage device of claim 2, 

wherein the alignment is performed on a phoneme level. 
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6. The program storage device of claim 2, 
wherein the instructions for calculating the distance 
include instructions for performing the steps of: 

calculating an individual distance between each 
5 aligned frame of the original waveform and each of the N 

synthetic waveforms; and 

summing the individual distances of the aligned 
frames of the original waveform and each synthetic waveform. 

7. The program storage device of claim 1, 

10 wherein the instructions for performing the comparing step 

include instructions for performing the steps of: 

(a) setting a parameter N=l; 

(b) retrieving the Nth synthetic waveform and the 
corresponding Nth text sequence; 

15 (c) time-aligning frames of the original waveform 

and frames of the Nth synthetic waveform to corresponding 
text of the Nth text sequence; 

(d) computing an individual distance between each 
corresponding aligned frame of the original and Nth 

20 synthetic waveform; 

(e) summing the individual distances to compute 
the distance between the original waveform and the Nth 
synthetic waveform; 
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(f) determining if the computed distance is less 
than a current best distance value; 

(g) setting the current best distance value equal 
to the computed distance and saving the Nth text sequence 
for consideration as the final output, if the computed 
distance is determined to be less than the current best 
distance threshold; 

(h) incrementing the parameter N by one; and 

(i) repeating steps (b) through (h) until each of 
the N text sequences have been considered. 

8. The program storage device of claim 7, 
wherein the instructions for performing the step of 
determining the individual distance (step d) include 
instructions for: 

computing a mean feature vector of all feature 
vectors comprising each aligned frame for both the original 
and Nth synthetic waveform, wherein the individual distance 
for each aligned frame is calculated by determining a 
distance between each mean of the corresponding aligned 
frames . 

9. A method for rescoring N-best hypotheses of a 
decoded original waveform output from an automatic speech 
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recognition system, the N-best hypotheses comprising N text 
sequences, the method comprising the steps of: 

generating a synthetic waveform for each of the N 

text sequences; 

comparing each synthetic waveform with the 
original waveform to determine the synthetic waveform that 
is closest to the original waveform; and 

selecting for output the text sequence 
corresponding to the synthetic waveform determined to be 
closest to the original waveform. 

10. The method of claim 9, wherein the comparing 
step includes the steps of: 

aligning frames of the original waveform and 
frames of each synthetic waveform to a corresponding one of 
the N text sequences; and 

calculating the distance between the original 
waveform and each of the synthetic waveforms based on the 
corresponding alignments. 

11. The method of claim 10, wherein the comparing 
step further includes the steps of: 

retrieving feature vectors corresponding to the 
original waveform; and 
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generating feature vectors for each synthetic 
waveform such that the feature vectors for the synthetic 
waveforms are similar in structure to the feature vectors of 
the original waveform; 

wherein the alignment is performed by 
time-aligning the feature vectors of the original waveform 
and the feature vectors of each synthetic waveform with the 
corresponding one of the N text sequences. 

12. The method of claim 10, wherein the step of 
calculating the distance includes the steps of: 

calculating an individual distance between each 
aligned frame of the original waveform and each of the N 
synthetic waveforms; and 

summing the individual distances of the aligned 
frames of the original waveform and each synthetic waveform. 

13. The method of claim 9, wherein the comparing 
step includes the steps of: 

(a) setting a parameter N=l; 

(b) retrieving the Nth synthetic waveform and the 
corresponding Nth text sequence; 
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(c) time-aligning frames of the original waveform 
and frames of the Nth synthetic waveform to corresponding 
text of the Nth text sequence; 

(d) computing an individual distance between each 
corresponding aligned frame of the original and Nth 
synthetic waveform; 

(e) summing the individual distances to compute 
the distance between the original waveform and the Nth 
synthetic waveform; 

(f) determining if the computed distance is less 
than a current best distance value; 

(g) setting the current best distance value equal 
to the computed distance and saving the Nth text sequence 
for consideration as the final output, if the computed 
distance is determined to be less than the current best 
distance threshold; 

(h) incrementing the parameter N by one; and 

(i) repeating steps (b) through (h) until each of 
the N text sequences have been considered. 

14. The method of claim 13, wherein the step of 
determining the individual distance (step d) includes the 
steps of: 
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computing a mean feature vector of all feature 
vectors comprising each aligned frame for both the original 
and Nth synthetic waveform, wherein the individual distance 
for each aligned frame is calculated by determining a 
5 distance between each means of the corresponding aligned 

frames . 



15. An automatic speech recognition system, 
comprising : 

a decoder for decoding an original waveform of 
10 acoustic utterances to produce N text sequences, the N text 

sequences representing N-best hypotheses of the decoded 
original waveform; 

a waveform generator for generating a synthetic 
waveform for each of the N text sequences; and 
15 a comparator for comparing each synthetic waveform 

with the original waveform to rescore the N-best hypotheses. 

16. The system of claim 15, further comprising a 
feature analysis processor adapted to generate a set of 
feature vectors for the original waveform and generate a set 

20 of feature vectors for each of the N synthetic waveforms 
using a similar feature analysis process. 
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17. The system of claim 15, further comprising a 
processor adapted to process one of the original waveform, 
the synthetic waveforms, and both, to compensate for 
speaker-dependent variations . 

5 18, The system of claim 15, wherein the 

comparator comprises : 

means for determining the synthetic waveform that 
is closest in distance to the original waveform; and 
means for outputting the N text sequence 
10 corresponding to the synthetic waveform that is determined 

to be closest to the original waveform. 

19. The system of claim 18, wherein the means for 
determining the closest synthetic waveform utilizes one of a 
distance score, a language model score, an acoustic model 

15 score, and a combination thereof, for determining the 

closest distance. 

20. The system of claim 18, wherein the means for 
determining the closest synthetic waveform comprises: 

means for aligning frames of the original waveform 
20 and frames of each synthetic waveform to a corresponding one 
of the N text sequences; and 
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means for calculating the distance between the 
original waveform and each of the synthetic waveforms based 
on the corresponding alignments. 



21. The system of claim 20, wherein the frames 
are aligned on a phoneme level. 

22. The system of claim 20, wherein the means for 
calculating the distance comprises: 

means for calculating an individual distance 
between each aligned frame of the original waveform and each 
of the N synthetic waveforms; and 

means for summing the individual distances of the 
aligned frames of the original waveform and each synthetic 
waveform to compute the distance between the original 
waveform and each synthetic waveforms. 
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SYSTEM AND METHOD FOR RESCORING N-BEST HYPOTHESES 
OF AN AUTOMATIC SPEECH RECOGNITION SYSTEM 

ABSTRACT OF THE DISCLOSURE 

A system and method for rescoring the N-best 
hypotheses from an automatic speech recognition system by 
comparing an original speech waveform to synthetic speech 
waveforms that are generated for each text sequence of the 
N-best hypotheses. A distance is calculated from the 
original speech waveform to each of the synthesized 
waveforms, and the text associated with the synthesized 
waveform that is determined to be closest to the original 
waveform is selected as the final hypothesis. The original 
waveform and each synthesized waveform are aligned to a 
corresponding text sequence on a phoneme level. The mean of 
the feature vectors which align to each phoneme is computed 
for the original waveform as well as for each of the 
synthesized hypotheses. The distance of a synthesized 
hypothesis to the original speech signal is then computed as 
the sum over all phonemes in the hypothesis of the Euclidean 
distance between the means of the feature vectors of the 
frames aligning to that phoneme for the original and the 
synthesized signals. The text of the hypothesis which is 
closest under the above metric to the original waveform is 
chosen as the final system output. 
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